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Introduction 


* Web crawling is a challenging experiment 
= Its perceived difficulty hinders non-commercial endeavors 


e Industry has been the major player 
- Reluctant to disclose actual methodology 


e Academic endeavors are limited 


- Popular belief that a Internet-wide requires huge hardware 
setup 


- Most published crawls are rather limited in size and span in 
the Internet and lack useful details about the crawl 


- No standard methodology to compare different crawls 
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Introduction (2 


* Our IRLbot crawl experiment in 2007 is the largest 
non-commercial crawl of the Internet to this date 
- Collected 7.3B pages in 41 days using a single crawler node 


* Here the objective is to dissect the collected data 
- Analyze Internet wide coverage, spam avoidance etc 
- Compare to commercial search engines using a novel method 
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Background - Inside a Web Crawler 


parsed links | Dal # of downloaded pages 


crawled URLs 


Rate limiting 


q Fraction of HTML 
allowed URLs pages 
PE h Duplicate Frontier mm 
ad elimination ranking RENE 
| p Fraction of unseen links 
DNS cache unique URLs Dalp | 
h it of crawler nodes 
M S Crawl rate (pages/sec) 


admitted URLs 
Dq(lp-1)/h 


e Forms a cycle where each component has to keep up 
to persist the crawl rate S 


* Example: IRLbot's duplicate elimination rate was over 
100K/s with peak rate S=3K pps, m=h=1 
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Background — Crawler Design (2) 


e Crawl design boils down to a trade-off {D/h, S/h, q, L} 
- Increase in one typically results in decrease in others 


e Different methods of scaling S in existing literature 
- Clear trade-off between D and S 
- Reduce q by crawling non-HTMLs (Mercator) 

Eliminate dynamic URLs to reduce / (ClueWeb09) 


Eliminate disk-based duplicate elimination by RAM-based 
method (UbiCrawler, WebBase), or by revisiting same pages 
(Internet Archive) 


* None of at-least-50M page crawis have real-time spam 
avoidance or global frontier prioritization 
- |RLbot uses real-time frontier prioritization 
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Background — Crawler Design (3) 


e Among distributed crawlers, one of the most prominent 
is ClueWeb09 
- Parallelized Apache Nutch to 1600 processors in Google- 
NSF-IBM cluster and discarded all dynamic links (i.e., 
dropping ! by 84%) 
- Crawled 1B pages in 52 days at average rate 222 pps 


e Some IRLbot Configuration and Features 

- Used m=h=1, (i.e., one single crawler node, seeded from 
only www.tamu.edu) 

- Highest q and unrestricted / 

- Used real-time frontier prioritization based on the PLD graph 

- Rate S and D determined by factors outside our control (i.e., 
university bandwidth) 

- Collected D=7.3B pages in 41 days at average rate 2K pps : 
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Page-Level Analysis - URL Cycle 


97 8% 98.9% 86.8% 


Connect Valid replies Full download (Found URLs | 
7,606,102,371 7,437,281,300 Robot error type URLs affected 
Robots.txt disallow — — txt disallow 296,966,591 
Non-text/html 
LEE error on robots.txt 106,638,856 
HTTP error on HTTP error on robots.txt = txt 24,221,293 
Networkrerro pag MESE 
Robots.txt forbidden 20,621,185 
Robots.txt loop 612,160 
y Robots.txt over 64KB 183,894 
d ext 
| nn a Ql 
Unique URLs Crawlable URLs 
41,502,195,631 377,995,369,202 
Unique edges | 
310,361,986,596 


Admitted URLs 
8,267,316,148 


== 5.6% 2.5% 


Robot errors No DNS | 
e449, 243,979 N 208,681,137 


94.4% 


0.04% 


Blacklisted 
Ng 3,281,661 


= 
y= 
2 
O 
Es 
= 
= 
> 
of} 
« 
N 
O 
> 
O 
FF 
© 
O 
(= 
© 
(©) 
N 
o 
= 
=) 
Q. 
= 
O 
O 


10 


Page-Level Analysis — URL Statistics 


97.8% 98.9% 86.8% 
Connect Valid replies Full download HTML 200 OK _À Found URLs 
7,606,109,371 7,437,281 ,300 7 350,805,233 6,380,051,942 387,605,655,905 
iE Network error type URLs affected 
Non-text/html 


Connect fail | fal 124,752,717 


Slow download 421 Em 


Page over 4MB 338,872 


0.1% 0.9% 
Send fail 9.365 
€ ^ 770,784 Bad ext 
n 287 


94.476 URLs with valid host Admitted URLs Unique URLs Crawlable URLs 
8,058,635,011 8,267,316,148 41,502,195,631 O11 979,307 202 
0.04% 5.6% 2.5% 


Blacklisted Robot errors No DNS Unique edges | 
Ng 3,281,661 he 449,243,979 Ng 208,681,137 310,361,986,596 


Network errors 
we 162,057,287 
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Connect 
7,606,109,37 | 


Page-Level Analysis — URL Statistics 


97 8% 98.9% 86.8% 


Valid replies Full download HTML 200 OK _4 Found URLs 
7,437,281,300 7,350,805,233 6,380,051,942 387,605,655,905 


1.1% 13.2% 


HTTP error type | URLs affected | affected } 
Ne 96476067 


ES HTTP response PAL affected. 139,148 
Network errors 


i 
mE 
509,04 

erc — | pg 


a "= 770, 784° 6,770,784 


Crawlable URLs 
377,995,369,202 


HTTP errors 
6,//0,784 


A Bad ext 
d,392,29 6,075 
URLs with valid host Admitted URLs Unique URLs 

8,058,635,011 8,267,316,148 41,502,195,631 

0.0476 5.676 2.5% 


Blacklisted Robot errors No DNS Unique edges | 
Ng 3,281,661 he 449,243,979 hé 208,681,137 310,361,986,596 
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Page-Level Analysis — URL Statistics 


97.8% 98.9% 86.8% 
Connect Valid replies Full download HTML 200 OK 4 Found URLs 
7,606,109,371 7,437,281,300 7,350,805,233 6,380,051,942 387,605,655,905 
1.1% 13.2% 


Non-text/html HTTP errors _4 Found URLs Total found URLs 
Ne 86,476,067 970,753,291 7,013,367,237 394,619,023,142 
3.3% TR 
Network errors < DA 
2.1% ho 162,057,287 FS Fail checks Pass checks 
< 13,27 1,356,965 381 ,327,666,1 7 


0.1% 0.9% 0.6% 20.7% 77 8% 


HTTP errors T 
he 6,770,784 : Bad ext Unknown ext Missing ext Dynamic/HTML 
< 3,332,296,975 2,561,590,507 78,938,236,200 296,495,542,495 
94.4% URLs with valid host Admitted URLs Unique URLs Crawlable URLs 
8,058,635,01 1 8,267,316,148 41,502,195,631 377,995,369,202 
0.04% 5.6% 2.5% 


Blacklisted Robot errors No DNS Unique edges 
Ng 3,281,661 he 449,243,979 ng 208,681,137 310,361,986,596 
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Page-Level Analysis - A Few Notes 


e Countering Spam 
Did real-time PLD ranking on the current web graph 
Treated 301/302 as regular links (processed through cycle) 


Detected slow downloads (no data for 60 sec or takes more 
than 180 sec) 


Detected infinite data stuffing and cut off after 4 MB 


* Avoid non-HTMLs 
- Only processed pages with "Content-type: text/html" (86.5M 
discarded objects would take 346 T B in the worst case) 
- Transmitted "Accept: text/html" header field, but resulted in 
only 6.6% reduction, while extension filtering leads to 0.37% 
(not very effective!) 
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* The result is 8.3 KB per object 14 
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A Few Notes (2) 


* URL Processing 


- Processed a-href, frame-src and meta-refresh tags. Did not 
follow img tags 


- Checked URLs for correctness and syntax 


- Used a black list of non-HTML extensions, resulted in 0.37% 
saving in bandwidth (note for future crawlers) 


* Web graph 
- Constructed a web graph with 3 TB web graph with 310B 
edges and 41B nodes 
- Average crawl depth 12 (compare to 1.8 of ClueWeb09) 
- 60% of downloaded pages were dynamic (i.e. contains “?”) 
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r-Level Analysis — DNS and 
Robots DNS expiration 


DNS lookups Hosts PLDs Queued URLs 
297,184,517 641,982,061 89,652,630 41,502,195,631 
ET. 
g Unique hosts Á á DNS found Á Live hosts ve Crawlable hosts 
L 260, 18826 à l 171,101,737 156,236,808 146,158,035 
34% | 51% 49% 
DNS error Dead hosts Robots — Robots 404 Robots 2xx 
89,011,891 14,864,929 10,078,773 73,776,294 72,381,741 
Name error Refused Receive fail Robot loop (text/plain ^ 
65,241,643 241,555 ge 641,064 87,318 94,747,200 , 
Server fail Reserved IP 401/403 S > 64KB No MIME 
22,885,240 128,029 Ré 3,371,889 C 1,895 17,513,363 
Invalid pkt Other Bad HTTP Send fail Other MIME 
513,556 1,868 he 2,976,378 à 229 120,478 


1.5 GB 


Server-Level Analysis - Bandwidth 


13.2% pull dauwnlnnd nanas. 86.8% 


Outbound Traffic | Inbound Traffic 


HTTP er DNS 23 GB DNS 37 GB OK 
7% 970,753 942 84% 
Robots 33 GB Robots 254 GB 
Compressed | | HTTP GET 1.8 TB 718 GB aborted Non-compressed 
65,867,784 5,329,096,697 
TCP/IP 1.1 TB 143 TB full d/l 
í Mops. 320 Mbp 2 Bytes 
2,174,885,272,428 6,614,286,502,692 128,151,894,854,297 
476 9676 5% 95% 
HTTP error bytes HTML 200 OK bytes 
2,258,814,609,583 | 1.5% 93.8% 134,766,181,356,989 


HTTP header bytes Full download bytes | TCP/IP header bytes 
2,571,309,459,132 | 1.8% 143,589,123,572,029 [oues 3,985,467 341,092 
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Internet rage 


* Can use different measures 
- Collection of crawled 200 OK pages 
- Constructed web graph size 


e Not much available information in standardized fashion 
- Mercator uses img tags, while UbiCrawler removes frontiers 
- WebBase considers robots.txt as crawled page 


Dataset Crawled (HTML 200 OK) Web graph Host graph PLD graph 
pages hosts PLDs TLDs | nodes edges nodes edges | nodes edges 


AltaVista [9] -— 

Polybot [36] 121M 

Google [6] -— 

Mercator [10] 429M ~ 10M 
WebFountain [20] 1B -— 


WebBase [16] 98M 51K " . 
ClueWeb09 [19] 1B — — ; | 
IRLbot 6.3B 117M 33M 256 310B 
UbiCrawler .uk [7] | 105M 114K - 3.7B 
Risak | 1970 28M 12M — 1 | 13B 93B) 
TeaPot .cn [41] 837M 16.9M 790K 43B Á - 790K 
IRLbot .cn 209M 3.3M 539K 1.1B 11.9B 8.4M 103M | 711K  19.7M 


1.3B 19.5B | 12.8M 395M 


19.7M 1.1B 


Pot | Ifi | | I 


Computer Science, Texas A&M University 
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Internet rage — TLD Level 


e A novel method of comparing crawls 
- Reveals crawler budget on different parts of the Internet 


e Use site queries (i.e., “site: domain ) to obtain Google 
and Yahoo's (now part of Bing) index size 


- |n 1/2008, = contained SUE and 37B pages; tae 
[ TLD | Googie | Yahoo | RLbot | WebBase | ClueWeb | 
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| TLDs | 255 | 256 | 256 | 174 | 254. 
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— Google Order 


fraction of pages 


fraction of pages 


50 100 
TLD sequence number 


(a) Yahoo (all) 


150 200 


=e eK ooo po oo oo 


30 
TLD sequence number 


(c) Yahoo (top 40) 


40 


ee qe mm 


fraction of pages 
o 


0 50 _ 100 150 


200 
TLD sequence number 


(b) IRLbot (all) 


4 info (#15) 
$ „edu (#12) 
“gov (#24) 
0 10 20 30 40 
TLD sequence number 


(d) IRLbot (top 40) 
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Extrapolation 


e Assume that the crawl is stochastic process {(X,,Y,)} 
on the Internet, a web graph G(V,E), where the 
process terminates at t = N« |E| edges 


* Define p(t)as the probability that URL Y, has not been 
seen before 


* Objective: In a larger crawl, 
can we estimate number of 
- Unique URLs Ly 
- Crawled pages C= Lv LU 


probability 


.# of links/page 
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0 0.2 0.4 0.6 0.8 1 
crawl fraction z 
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Extrapolation (2) 


e Assume that the reference crawl (IRLbot) has X links, 
U unique links. The unknown crawl (e.g., Google) has 
N links (r=N/K). What is Ly and Cy? 


* Also assume z=t/K and a new function P(2) = p(zK) . 
Thus, the unknown crawler has: 


E[Ln] = JJ! p(z)dz =U + k | p(z)dz 
* With Pareto fit (i.e., f(z) = Bz ^), we get: 


GG =? 1) 
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Extrapolation - Results 


Crawl jag Crawled | Crawled 
Links N | Pages C, 


IRLbot 2007 394B 6.3B 
Google 2008 40 1211 256B 
(E|Ly| = 1T) 
Google 2012 1,981 592T TATE 
H = 30T 


Using 20B pages/day (641 Gbps), takes 50 months of crawling 


* How about Hots/PLD level graphs in Google 2012? 


- With r=1981, Google has 5.2B unique hosts (IRLbot has 
641M), and 90.6M unique PLDs (IRLbot has 89M) 
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Conclusion 


* Presented IRLbot implementation and experiment in 
detail 
- Discussed the impact of various design choices 
- Provided guidelines for future crawlers 
- Exposed weird/effective spamming techniques 


* Developed new methods for capturing crawl coverage 


* Outlined a simple extrapolation mechanism to infer 
proprietary and undocumented crawls 


- A simple model for crawl growth rate 
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Thank you! 
Questions? 
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